ARCOMEM Crawling Architecture
نویسندگان
چکیده
منابع مشابه
ARCOMEM Crawling Architecture
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limita...
متن کاملThe ARCOMEM Architecture for Social- and Semantic-Driven Web Archiving
The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into “community memories” that aim at building a better understanding of the public view on, e.g., celebrities, court decisions and other events. Due to the size of the Web, the traditional “...
متن کاملAn Architecture for Efficient Web Crawling
Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Deep Web in an efficient way. Existing proposals in the crawling area fulfill some of these requirements, but most of them need to download pages in order to classify them as relevant or not. We propose a crawler supported by a web page classifier that uses solely a page URL to determine page re...
متن کاملAn Architecture for Selective Web Harvesting: The Use Case of Heritrix
In this paper we provide a brief overview of the crawling architecture of ARCOMEM and how it addresses the challenges arising in the context of selective web harvesting. We describe some of the main technologies developed to perform selective harvesting and we focus on a modified version of the open source crawler Heritrix, which we have adapted to fit in ACROMEM’s crawling architecture. The si...
متن کاملA Novel Architecture of Agent based Crawling for OAI Resources
Nowadays, most of the search engines are competing to index as much of the Surface Web as possible with leaving a lurch at the OAI content (pdf documents), which holds a huge amount of information than surface web. In this paper, a novel framework for OAI-PMH based Crawler is being proposed that uses agents to extract the metadata about the OAI resources and store them in a repository which is ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Future Internet
سال: 2014
ISSN: 1999-5903
DOI: 10.3390/fi6030518